Skip to content

feat: add health monitoring and health check management#2

Merged
cersho merged 2 commits intomainfrom
cersho/feat-health-check
Mar 21, 2026
Merged

feat: add health monitoring and health check management#2
cersho merged 2 commits intomainfrom
cersho/feat-health-check

Conversation

@cersho
Copy link
Copy Markdown
Owner

@cersho cersho commented Mar 21, 2026

  • Implemented a health monitoring system with a new Monitor struct to manage health checks for database connections.
  • Added functionality to probe connections for various database types, including PostgreSQL, MySQL, Convex, and Cloudflare D1.
  • Created a health check section in the UI to display the status of health checks, including the ability to configure settings and view notifications.
  • Enhanced the app's HTML template to include health check indicators and a detailed view of affected databases.

Summary by cubic

Adds automated health monitoring for database connections with a background monitor, API and UI to manage checks and notifications, plus a header status indicator with a down-database popover.

  • New Features

    • Background health monitor polls due checks with configurable interval, batch size, and concurrency; ensures defaults on startup.
    • Auto-creates a health check for each connection on creation using configured interval/timeout and pass/fail thresholds.
    • API: /health-checks (list/get), /health-checks/{id} (PATCH settings), /health-checks/{id}/run (manual run with cooldown), /health-checks/{id}/notifications (list/set bindings).
    • Manual run enforces cooldown and responds with 429 + Retry-After when invoked too soon.
    • Notifications fire on "down" and "recovered"; new destinations auto-attach to all checks; per-check bindings can toggle Down/Recovered.
    • Probing: TCP for PostgreSQL/MySQL, endpoint check for Convex, and a lightweight query for Cloudflare D1.
    • UI: Health Checks page to view status, edit settings, manage bindings, and “Run now”; header shows overall health with a popover listing affected databases.
    • Cleanup: deleting a connection removes its health checks and bindings; deleting a notification removes its health bindings.
  • Migration

    • New tables: health checks and health check notifications (auto-migrated).
    • Config (safe defaults): HEALTH_POLL_INTERVAL, HEALTH_DUE_BATCH_SIZE, MAX_CONCURRENT_HEALTH_CHECKS, DEFAULT_HEALTH_CHECK_INTERVAL, DEFAULT_HEALTH_CHECK_TIMEOUT, DEFAULT_HEALTH_FAILURE_THRESHOLD, DEFAULT_HEALTH_SUCCESS_THRESHOLD, HEALTH_MANUAL_RUN_COOLDOWN.
    • For D1 probing, set CLOUDFLARE_ACCOUNT_ID, CLOUDFLARE_DATABASE_ID, CLOUDFLARE_API_KEY (optional D1_API_BASE_URL). Existing connections are seeded with checks on startup.

Written for commit f00d088. Summary will update on new commits.

- Implemented a health monitoring system with a new Monitor struct to manage health checks for database connections.
- Added functionality to probe connections for various database types, including PostgreSQL, MySQL, Convex, and Cloudflare D1.
- Created a health check section in the UI to display the status of health checks, including the ability to configure settings and view notifications.
- Enhanced the app's HTML template to include health check indicators and a detailed view of affected databases.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 13 files

Confidence score: 3/5

  • There is concrete regression risk in internal/health/monitor.go: Stop() can return before runCheck goroutines finish, so background work may still touch m.repo/m.notifier during shutdown.
  • internal/repository/repository.go has a consistency gap: notification creation and health-check binding are committed separately, so a failure in attachNotificationToAllHealthChecks can leave partially persisted state while returning an error.
  • internal/notifications/dispatcher.go has a user-facing formatting issue where an empty Connection.Name can produce a blank name in message bodies (subject has fallback, body does not), which is lower severity but still noticeable.
  • Pay close attention to internal/health/monitor.go, internal/repository/repository.go, and internal/notifications/dispatcher.go - shutdown synchronization, transactional consistency, and message fallback behavior need validation.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/notifications/dispatcher.go">

<violation number="1" location="internal/notifications/dispatcher.go:312">
P2: Missing fallback for empty `Connection.Name` in the message body. `buildHealthSMTPSubject` falls back to `check.ConnectionID` when the name is empty, but `formatHealthMessage` writes the name directly, producing a blank `- Connection: ` line.</violation>
</file>

<file name="internal/health/monitor.go">

<violation number="1" location="internal/health/monitor.go:70">
P1: **`Stop()` returns before in-flight health-check goroutines finish.** Goroutines spawned inside `processDue` are not tracked, so `Stop()` can return while `runCheck` goroutines are still using `m.repo` and `m.notifier`. Use a `sync.WaitGroup` to track spawned goroutines and `wg.Wait()` inside `Stop()` after the ticker loop exits.</violation>
</file>

<file name="internal/repository/repository.go">

<violation number="1" location="internal/repository/repository.go:313">
P2: The notification create and the health-check binding create are two independent commits. If `attachNotificationToAllHealthChecks` fails, the notification is already persisted but the caller gets an error, leaving the system in an inconsistent state. Wrap both operations in a single transaction, similar to how `DeleteNotification` handles multi-table changes.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant UI as Client (UI/API)
    participant HMon as NEW: Health Monitor (Background)
    participant Repo as Repository (GORM)
    participant Probe as NEW: Connection Prober
    participant DB as Target DB (PG/MySQL/D1/Convex)
    participant Notif as Dispatcher (SMTP/Discord)

    Note over HMon, Repo: Periodic Background Loop
    loop Every HEALTH_POLL_INTERVAL
        HMon->>Repo: ListDueHealthChecks(now, batchSize)
        Repo-->>HMon: HealthCheck metadata + Connection
        
        loop For each due check
            HMon->>Repo: NEW: ClaimHealthCheckRun(id, lease)
            Note right of Repo: Prevents concurrent runs
            Repo-->>HMon: Claim Success/Failure
            
            opt Claimed successfully
                HMon->>Probe: NEW: ProbeConnection(connection, timeout)
                
                alt TCP (Postgres/MySQL)
                    Probe->>DB: Dial TCP Reachability
                else API (Cloudflare D1)
                    Probe->>DB: POST /query (SELECT 1)
                else Endpoint (Convex)
                    Probe->>DB: HTTPS reachability
                end
                
                DB-->>Probe: Result (Success/Error)
                Probe-->>HMon: ProbeResult (Healthy/Unhealthy)
                
                HMon->>Repo: NEW: SaveHealthCheckProbeResult(id, result)
                Note right of Repo: Increments consecutive fail/success counts
                
                alt NEW: Threshold met (Status change)
                    Repo-->>HMon: Return Event (DOWN or RECOVERED)
                    HMon->>Notif: NEW: NotifyHealthCheckEvent(check, event)
                    Notif->>Notif: Filter by Binding (OnDown/OnRecovered)
                    Notif->>UI: External Notification (Discord/Email)
                else No status change
                    Repo-->>HMon: Return no event
                end
            end
        end
    end

    Note over UI, Probe: Manual Trigger Path
    UI->>Repo: NEW: POST /health-checks/{id}/run
    Repo->>Repo: Check NEW: HEALTH_MANUAL_RUN_COOLDOWN
    
    alt Cooldown Active
        Repo-->>UI: 429 Too Many Requests (Retry-After)
    else Cooldown Expired
        Repo->>Probe: NEW: ProbeConnection(...)
        Probe->>DB: Execute Probe
        DB-->>Probe: Result
        Probe-->>UI: Return immediate health status
    end

    Note over UI, Repo: Connection Lifecycle
    UI->>Repo: Create NEW Connection
    Repo->>Repo: Save Connection Record
    Repo->>Repo: NEW: UpsertHealthCheckForConnection(...)
    Note right of Repo: Seeds default intervals/thresholds from Config
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@cersho cersho merged commit 1c1d287 into main Mar 21, 2026
5 checks passed
@cersho cersho deleted the cersho/feat-health-check branch March 21, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant